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Algebraic specification has a long tradition in bridging the gap between specification and program- 
ming by making specifications executable. Building on extensive experience in designing, imple- 
menting and using specification formalisms that are based on algebraic specification and term rewrit- 
ing (namely ASF and Asfh-Sdf), we are now focusing on using the best concepts from algebraic 
specification and integrating these into a new programming language: RASCAL. This language is 
easy to learn by non-experts but is also scalable to very large meta-programming applications. 

We explain the algebraic roots of RASCAL and its main application areas: software analysis, soft- 
ware transformation, and design and implementation of domain-specific languages. Some example 
applications in the domain of Model-Driven Engineering (MDE) are described to illustrate this. 

1 Introduction 

Algebraic specification has a long tradition in bridging the gap between specification and program- 
ming ll22l[T8]| . There has always been a tension between algebraic specifications as mathematical objects 
with certain properties and algebraic specifications as executable objects. This tension is nicely summa- 
rized by the label "algebraic programming". 

Experience has taught us that when a formalism is made executable it effectively becomes a pro- 
gramming language. Even if the language operates on a higher level of abstraction, common engineering 
issues arise when developing and maintaining specifications. Like programs, executable specifications 
have bugs and thus require debugging; they are slow and thus need to be optimized; they are complex 
and thus need to be analyzed in order to be understood; and finally their life extends beyond the first 
version and thus they need to be maintained to accommodate new requirements. The nature of alge- 
braic specification exacerbates the difficulty of some of these common software engineering tasks. This 
is due to the inherent non-deterministic nature of algebraic specification and the complexity of highly 
optimized execution platforms (term rewriters). What actually happens at run-time, and why and when, 
is conceptually far removed from what is specified. 

We describe the language RASCAL we are currently working on. It is a dedicated language for meta- 



programming (Figure 1 ). This means that programs can be the input and output of RASCAL programs. 
Rascal's primary applications are in software analysis, software transformation and design and im- 
plementation of domain-specific languages. The word "software" should here be interpreted in a broad 
sense: subjects for analysis and transformation include source code, models and meta-data such as doc- 
umentation, version histories, bug trackers, log files, execution traces, and more. RASCAL is rooted in 
algebraic programming and is targeted at solving large, real-life, problems. 

The goal of this paper is to explain why and how RASCAL is not an algebraic specification formalism 
with programming language features, but rather a programming language with algebraic specification 

Francisco Duran and Vlad Rusu (Eds.): 

Second International Workshop on Algebraic Methods 

in Model-Based Software Engineering 20 11 (AMMSE' 11) 



EPTCS 56, 2011, pp. 15-[32j doi: 10.4204/EPT CS.56.2| 



16 Rascal: From Algebraic Specification to Meta-Programming 



iformation Analysis Coiiversii 



Code 



Generation 



Model 



Visualization 



Formalization 



Picture 



Figure 1: The meta-programming domain: three layers of software representation with transitions. 

features. The plan of the paper is to first summarize in Section[2]the lessons we have learned in designing 
and applying several languages for algebraic programming. These lessons form the starting point for 
Rascal's design requirements. We sketch the resulting language and explain where it deviates from the 
purely algebraic paradigm in Section[3] In Section|4]we illustrate RASCAL by detailing three applications 
in the domain of Model-Driven Engineering (MDE), one of the prime application areas of RASCAL, as 
well as one linking RASCAL with existing algebraic specifications. We conclude in Section|5] 

2 An Algebraic Perspective to Meta-Programming 

Rascal succeeds Asf+Sdf [T.'FI as our platform for experimenting with and implementing language 
definitions and other meta-programs. Asf+Sdf consists of two parts: ASF, the Algebraic Specification 
Formalism (used for rewriting terms), and Sdf, the Syntax Definition Formalism (for specifying gram- 
mars). Below we summarize, with hind-sight, our experiences with Asf-i-Sdf that have motivated the 
design of RASCAL. We first focus on ASF and after that specifically address the lessons we have learned 
from its combination with Sdf. 

2. 1 Experience with A S F + S D F : the case of A S F 

Our first focal point is ASF, which originally was a standalone formalism (ASF, 121): 

• An ASF specification is modular. Modules import each other, while optionally instantiating sort 
parameters and/or renaming sorts. 

• Each module contains function signatures that declare (first order) typed functions. Functions can 
be either constructors (i.e. function names that can occur in normal forms) or defined functions 
(i.e. functions that will be eliminated by applying equations). Originally, the add function on 
natural numbers (represented by the sort NAT) was declared in ASF as: add : NAT # NAT -> NAT. 
The latest incarnation of of Asf-(-Sdf uses Sdf |[T9l[33]| to define signatures (cf. below). 

• An equation in ASF consists of an equality between two terms that respects the declared many 
sorted function signatures. Optionally, this equality may be preceded by one or more conditions 
that can be equalities or inequalities between terms. A conditional equality does not imply full 
unification: only one side of a positive condition may introduce variables and inequalities may not 
introduce variables at all. 

• Equations may be marked as "default" equations that apply if no other relevant equations apply. 

• Equations can use list matching — pattern matching modulo associativity of list construction — 
facilitating handling programming language constructs like statement and parameter lists. 

The add function mentioned above could be defined in ASF using the following equations (assuming 
appropriate definitions for the NAT constant and the successor function succ): 
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[addl] add(X, 0) = X 

[add2] add(X, succ(Y)) = succ(add(X, Y) ) 

The design and use of ASF have always been focused on executability by way of (left-most inner- 
most) term rewriting. Initial implementations of ASF compiled to Prolog, later ones (in the context of 
Asf-i-Sdf) to highly efficient C code. As part of Asf-i-Sdf, Asf has been successfully used for the anal- 
ysis and transformation of multi-million line software systems and for the implementation of industrial 
domain-specific languages [9]. These experiences have lead to the following observations: 

• Although Asf allows arbitrary rewrite rules, programmers almost without exception write strongly 
confluent and terminating sets of rules. They do that by introducing enough intermediate function 
symbols and by strictly using default rules when not all cases need to be matched by a function 
symbol. In other words, a locally non-confluent specification is almost always considered to be 
buggy rather than simply declarative. 

• Programmers in the meta-programming domain write specifications under the assumption of left- 
most innermost reduction. By doing this they use ASF as a first-order functional programming 
language with advanced pattern matching features. 

• Practically all bugs in ASF specifications are caused by non-matching terms in conditions and 
therefore non-reducing terms that were supposed to be reduced. Debugging a specification amounts 
to carefully simplifying an input term to the smallest possible term that triggers a bug, then running 
the specification and locating the offending rewrite rule. 

• Term rewriting can be implemented extremely efficiently and scales to big appUcations in meta- 
programming. Much of the efficiency is caused by maximally sharing sub-terms which allows 
small memory footprints and equality checking in 0(1) IJITlfTOl . This also implies immutability 
of data values at run-time. 

• For large-scale meta-programming, which implies signatures of hundreds of constructor functions 
for the abstract syntax trees of programming languages, simple recursion over deep terms needs to 
be automated. Many meta-programs are "structure shy": they only apply to some node types of the 
abstract syntax and such nodes may be buried deep in a term. We have extended ASF with so-called 
traversal functions [8| to facilitate automatic type-safe traversal. This feature commonly reduces 
the size of an ASF application, sometimes up to 95% depending on the size of the language. 

• Rule-based programming is not for everyone: it requires special training and experience to use 
effectively. Programmers with a formal computer science background have no trouble using ASF, 
while programmers without such background have difficulty adapting to this paradigm. They are 
surprised by the fact that simple things, like iteration over a list, may require two or more non- 
trivial rewrite rules or the use of a complicated list pattern, while other operations that are complex 
in a normal programming language are suddenly completely trivial. As a result, their previously 
acquired engineering skills seem useless. 

• Text-book algorithms for static analysis and program optimization are not easily translated into 
the algebraic paradigm. Instead one must re-think these algorithms and visualize their effect in 
run-time and memory consumption as the execution platform executes them. This implies that to 
start analyzing and transforming programs, one must first re-think the basics. Although this is in- 
teresting from an academic perspective, from the software engineering perspective this represents 
a negative investment. 
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• Sets of rewrite rules and algebraic signatures are open for extension. One can add a new function 
to a certain sort and simply add alternatives for all the functions that process that sort. Example: 
we add a "do-while" construct to a language and then we add new rewrite rules for the extended 
definition of a control-flow graph extractor. This is called open extensibility: without changing 
existing code functionality can be extended in a modular fashion. 

We have used ASF as a high-level programming language, applying it to different forms of meta- 
programming. We had to extend algebraic specification with default rules and traversal functions to 
obviate the need for large amounts of boilerplate rewrite rules. Unfortunately, as our student influx be- 
came less formally educated, we could not keep using ASF as a vehicle for education in software analysis 
and transformation. 

2.2 Experience with Asf+Sdf: the case of Sdf 

Since the original goal of Asf-i-Sdf was describing programming languages, it includes a built-in facility 
for describing syntax: the Syntax Definition Formalism (SDF [19]). Naturally, this combination of 
parsing and rewriting makes Asf-i-Sdf specifically apt for the domain of meta-programming. Parsing 
the input source code enables all further analysis and transformation. The essential characteristics of 
Asf-i-Sdf derived from Sdf are: 

• Function signatures correspond to context-free grammar rules, similar in semantics to EBNF For 
example, the add function is now declared as NAT "+" NAT -> NAT. Appropriate priorities and 
associativity can also be defined. A non-terminal is a sort, a grammar rule is a function. Fur- 
thermore, Sdf supports variable definitions, a class of non-terminals specifically tagged to be 
meta-variables. 

• In Asf-i-Sdf equations we write concrete syntax patterns instead of prefix term patterns. Any 
production rule in a context-free grammar is a term constructor. The syntax of terms is completely 
user-defined. 



• Sdf integrates lexical and context-free syntax definitions to generate scannerless parsers II33I . 
This helps in broadening the scope of programming languages that can be accepted, for example 
to allow the analysis and transformation of languages that do not have a separate tokenizer or to 
allow parsing of embedded languages that have different sets of reserved keywords in different 
contexts. 

• When modules are combined by way of import, the signatures that they declare are merged. In 
the case of Sdf this means composition of complete grammars. Because only the full class of 
context-free grammars is closed under composition, Sdf is supported by a parser which supports 
all context-free grammars, rather than a subset like LALR(l) or LL(1). 

• List matching is implemented for regular expressions in Sdf notation, e.g., terms of type 
Statement*, which corresponds to the language of possibly empty lists of statements, may be 
matched using arbitrary list patterns. 

Using Sdf as a front-end to ASF, the add example can now be written as follows: 

[addl] X + = X 

[add2] X + succ(Y) = succ(X + Y) 

Note both how the concrete syntax of the add function (+) can be used in the equations without quotation, 
and the use of X and Y as meta-variables ranging over the non-terminal for recognizing natural numbers. 
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Work on Asf+Sdf has been documented in lITTl . Writing and executing Asf+Sdf specifications is 
supported by the Asf+Sdf Meta-Environment ll24l 171161. The following list of observations summarize 
our experience in using Asf+Sdf, this time focusing on the consequences of using Sdf: 

• Parsing and term rewriting are the only main features of Asf+Sdf, and they are inseparable. 
This is both a strength and a weakness. The strength is conceptual simplicity and expressivity. 
The weakness is that one must understand both together and this makes the learning curve steep. 
New users struggle to conceptually separate parsing from rewriting when confronted with a bug or 
unexpected behavior. 

• The upside of modular grammars is their unlimited composability. The downside is that no guar- 
antee can be given as to whether the composed grammar is unambiguous. Solving ambiguities is 
difficult and requires expert knowledge |[T1. 

• Since a signature is defined by a context-free grammar in Asf-i-Sdf, any type errors — providing 
the wrong type of argument to a function — result in parse errors. This is rather uninformative, 
and can be especially confusing to new users. 

• Some typical programming languages have non-context-free syntaxes. COBOL files for example 
have "margins" that are line based, while inside the margins the syntax is not line based. Con- 
sequently, it is impossible to parse such languages using just context-free grammars. Users of 
Asf-i-Sdf have written preprocessors in scripting languages such as Perl and Python to remove 
margins or indentation and put them back later. 

• The meta-programming paradigm requires "high fidelity" in rewriting source code to source code. 
Otherwise unimportant details such as whitespace and source code comments need to be retained 
in meta programs that transform existing software systems. To facilitate high fidelity source- 
to-source transformation, we have extended the ASF execution engine to rewrite full parse trees 
instead of abstract syntax trees. 

• All data that is processed must first be specified as a context-free grammar. For example, the 
Asf-i-Sdf to C compiler contains a grammar for a subset of ANSI C and of an intermediate pat- 
tern matching automaton. The COBOL control flow visualization tool contains a grammar of 
graphviz's dot formalism. The Asf-i-Sdf library even contains several definitions of XML. Defin- 
ing good grammars is hard work, but in Asf-i-Sdf there is no way around it. 

• Program analyses frequently require the representation of graphs, such as control flow graphs or 
data dependency graphs. These can be easily encoded as parse trees (which contain tree nodes 
and lists), but this representation induces a significant loss of efficiency. Furthermore, operations 
on sets, relations and graphs are encoded as traversals over lists, with similar loss of efficiency in 
computation. 

• What you see is what you get. Since in Asf-i-Sdf all data is a parse tree, simply unparsing it 
renders a complete and readable representation of all input, output and even intermediate data 
structures. This helps in making complex algorithms debuggable. 

The combination of context-free grammars and term rewriting is very powerful, but is not without its 
drawbacks. All data is required to be described by a context-free grammar, hence parsing is a first and 
unavoidable step in creating any meta-programming tool. Input, output and intermediate representations 
are limited by context-free grammars: efficient representations of data that is more complex or simpler 
are not available. 



20 Rascal: From Algebraic Specification to Meta-Programming 

2.3 Lessons learned from other formalisms 



The strategic programming language Stratego 11341 was motivated by similar experiences with Asf+Sdf. 
Its design aims to keep the intention of rewrite rules as algebraic equalities, but on top of that introduces 
expressive rewrite strategies to compose them. In other words, Stratego extended algebraic programming 
with higher-order parameterized rule application. Stratego's strategies, which derived from the rewriting 
strategies in ELAN H, are a true first-class programmable feature rather than a conservative extension 
of algebraic specifications. In Stratego, the strategies drive the computation, not the rewrite rules. We 
noticed that in most meta-programming applications of Stratego the strategies rather than the rewrite 
rules do the heavy lifting. This again emphasizes the programming rather than the specification features. 

TXL lfT6l is a functional programming language intended to implement language extensions and soft- 
ware transformations. It has surprising similarities to Asf-i-Sdf, but does not have its roots in algebraic 
specification. This is an eye opener. Although somewhat different, pattern matching, substitution, traver- 
sal, and BNF rules are all features of TXL, while it does not feature algebraic equations that are applied in 
a non-deterministic fashion. TXL is also known to be very successful in the area of meta-programming, 
so we may hypothesize that key components of algebraic programming are success factors in the meta- 
programming domain, rather than the whole integrated concept of algebraic specification. 

Among other things, from ELAN 1 4 1 and Maude 1.14.1 we have learned how ACI matching (associative 
and commutative matching with defined identities) can be used to express sets and relations. Maude 
is known to be strong in analysis algorithms, such as model checking, while Asf-i-Sdf is naturally 
stronger in transformation intensive applications. We have also experimented with a language called 
RScript 1.25.1 . inspired by relational analysis tools such as Grok [23.1 and Crocopat [3], to verify that 
exphcit set and relation operators would match the software analysis application domain in combination 
with fact extraction implemented directly in Asf-i-Sdf. 

Another source of inspiration is ANTLR fSOl. Although ANTLR does not support all context-free 
grammars, it shines in its applicability and popularity among meta-programmers. This is caused by the 
perspective that meta-programs need to be included in a bigger software engineering context. ANTLR 
ensures by the design and implementation of the code it generates that ANTLR-based tools can be seam- 
lessly integrated in ordinary software projects. A reason for this is that the code that is generated is 
almost what the programmer would write manually. ANTLR connects to the background of advanced 
software engineers rather than to the background of computer scientists. This is another eye-opener. 

3 Rationale for the Design of Rascal 



Rascal's goal is to cover the domain of meta-programming as a whole (see Figure 1 1. We now first 
enumerate which Asf-i-Sdf features are desirable to keep and which to avoid. Then we present an 
overview of the features of the language. Based on our experience, we came to the conclusion that the 
following Asf-i-Sdf features are desirable: 

• Context-free grammars (and scannerless parsing) for the modular definition of the syntax of real 
programming languages. 

• Pattern matching (and list matching), for finding patterns in programs. 

• Pattern matching for dispatch over language constructs to obtain open extensibility. 

• Default definitions to prevent boilerplate completion of alternatives. 

• Automated traversal to support structure-shy applications. 
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• Concrete syntax for matching and construction of source code fragments. 

• Immutability of data to facilitate efficient rewriting and for a safe programming environment. 

• "what you see is what you get". Similar to the parse trees of Asf+Sdf, all data should have a 
standard, complete and human-readable serialized representation. This notation should coincide 
exactly with the notation for expressions in RASCAL. 

We also concluded that the following features are undesirable: 

• Non-determinism in dispatch over language constructs. Meta-programs are mostly deterministic. 
So, simple rewrite rule semantics should be restricted. 

• The necessity of defining a context-free grammar for every kind of data. We want to re-introduce 
abstract data-types as sepai^ate feature and have the possibility to compute directly with basic data- 
types such as strings, reals and integers. 

• The paradigm of sets of rewrite rules is too exotic for many software engineers. 

• The all-or-nothing experience of Asf-i-Sdf. We need a language that can be introduced feature- 
by-feature, starting from a simple and understandable (procedural) basis. 

• Type-checking by parsing is confusing. 

We have now explained why RASCAL should be different. Now we explain how it is different. 
Starting from the aforementioned features of Asf-i-Sdf we reorganized them into separate, independent 
layers and have added features if we considered them missing. These ingredients were then synthesized 
into the language design by an iterative design process, in which we reviewed a number of key use cases. 
These were static analysis algorithms, source-to-source transformations, type-checkers and source-code 
generators. The resulting language is RASCAL ll27l [26{j 

Rascal is organized in a core layer which contains basic data-types (booleans, integers, reals, 
source locations, date-time, lists, sets, maps, relations), structured control flow (if, while, switch, for) 
and exception handling (try, catch). To use the core you must understand that all data is immutable and 
that all code is statically typed. From this point of view, RASCAL looks like a simple general purpose 
programming language with built-in, immutable data structures. 

The following is a list of features that can be learned on a "need to know" basis. The layers are 
progressively more domain specific to the meta-programming domain: 

• List, set, and map comprehensions for the construction of SQL-like queries and analyses. This 
includes the < - element generation operator, which can enumerate the elements of all container 
data-types, like lists, sets, maps, and trees. The same operator is used in for loops. 

• Algebraic data type definitions for the definition of (intermediate) abstract data-types. These are 
similar to the data type facilities in functional programming languages. 



• 



• 



Advanced pattern matching operators, like deep match (/), negative match (!), set matching and 
list matching. These can, for instance, be used in switch cases, for loops and comprehensions. 

String templates with margins and an auto-indent feature. The margins of strings allow one to 
indent a template with the nesting depth of the RASCAL program while embedding a multi-line 
source code template. This enhances the formatting of RASCAL template-based code generators. 



'http://www. rascal-mpl.org 
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• A visit statement, which is an extension of switch that traverses arbitrarily nested data in order to 
perform structure-shy analysis and transformation. Cases of a visit may substitute in place and/or 
have side effects on (local) variables by executing arbitrary RASCAL code. Visit is parameterized 
by a traversal strategy to allow different traversal orders. 

• A solve statement for fixed-point computation. 

• Syntax definitions using an EBNF-like notation for generating parsers. This includes disambigua- 
tion facilities. 

Additionally, RASCAL is designed specifically to help the programmer create safe, modular and 
generic meta-programs in the following ways: 

• Type inferencing for local variables in functions. Formal parameters and return types of functions 
must be explicitly typed. This prevents typing errors from leaking between function definitions. 

• There is no down-cast operator. Instead all down conversions are done by matching. To use a 
variable that is bound by a match, the programmer must include the match in a conditional context, 
such as an if, for, switch, visit or comprehension to ensure that in the body of that construct the 
variables are bound. As a result, RASCAL programs have no "ClassCastException"-like run-time 
exceptions. 

• Lexically scoped backtracking. Each body of a conditional statement or expression that uses non- 
deterministic pattern matching may use the fail statement to undo the effects of the current scope 
and jump to the next available match. 

• The formal parameters of a function may also be arbitrary patterns, like the left-hand sides of 
rewrite rules. Each alternative for a certain function name must have mutually exclusive patterns. 
If this can not be realized, one of the alternatives must have the default modifier to indicate that 
it will be tried only after the other patterns fail. This gives us open extensibility: add a rule in a 
syntax definition or a data definition and you can add an alternative definition for any function that 
operates on that type. 

• Rascal's sub-typing lattice supports a number of layers that allow algorithms to work on different 
levels of generality. This is complementary to having type parameters for generic functions and 
type -parameterized abstract data-types. The value type is the top type. Algorithms that do not 
assume anything about a value use this type. The void type is the bottom type. This is an example 
of a longest possible sub-type chain: void < Statement < Tree < node < value. In this chain 
value, node and void are built-in, while the others are defined in Rascal. The node type repre- 
sents the common super-type of all abstract data-types, allowing access to and modification of the 
names and children of constructors. Tree is a library definition of an ADT for all parse trees and 
Statement is a defined non-terminal from a syntax definition for some programming language like 
C or Java. Functions may operate on each of these 5 levels, with the level chosen based on the 
amount of detail about the parameter that is needed. Another benefit is that the implementation of 
parse trees, which are central to meta programming, is completely transparent to the programmer. 

To summarize. Rascal is a value -based procedural programming language with high-level domain spe- 
cific built-in operators. These operators come from algebraic programming and relational calculus. We 
realize that this is not a formal definition nor a full explanation of the language. It should, however, be a 
good starting point for getting an impression of RASCAL and its design rationale. Details of Rascal may 
change as we get more feedback, and it may be extended, but this design will not change. 
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Example Getting back to our running example, there are many ways in which the add example can be 
rewritten in Rascalg One is to use a switch statement for case distinctiono 

NAT addKNAT x, NAT y) { 
switch (y) { 
case z( ) : return x; 

case succ(NAT y): return succ{addl{x, y)); 
} 

} 
This is a traditional (mostly imperative) programming style which is fully supported. Another way of 
writing this same example in RASCAL is to write a function for each case. Note that RASCAL generalizes 
the notion of a function signature from a list of typed variables to a list of patterns that may contain 
(possibly deeply nested) variables. Pattern-matching at the call site determines which version of the 
function is actually called {pattern-directed invocation). The two functions for defining add are then 
written as: 

NAT add2(NAT x, z()) { return x; } 

NAT add2(NAT x, succ(NAT y) ) { return succ{add2{x , y)); } 
We can approach algebraic equations even further, since functions that return a single expression can be 
abbreviated as follows: 

NAT add2(NAT x, z()) = x; 

NAT add2(NAT x, succ(NAT y)) = succ{add2{x , y)); 
This example illustrates that one can stay close to algebraic specifications, but that mixtures of algebraic 
style and imperative style are supported as well. This is very convenient when mixing, for instance, 
axiom-based simplification rules with more imperative symbol table handling. 

Rascal provides lists (with associative matching) and sets (with associative and commutative match- 
ing) further strengthening the algebraic flavor. Although RASCAL remains true to its algebraic roots, the 
overall feeling of the language is that of a programming language rather than a specification language. 
This is not only because we opted for a Java-like notation, but also because we have packaged con- 
cepts differently and have introduced some non-algebraic concepts like, most notably, global and local 
variables, comprehensions, and standard control flow. A very simple example can illustrate this. We 
define binary trees with integers as leaves and composite nodes that specify the color of the node and two 
subtrees. This can be defined as follows: 

data ColoredTree = leaf{int n) 

I composite{str color, ColoredTree left, ColoredTree right); 

Next we want to analyze a ColoredTree and compute a frequency distribution of the colors used in 
composite nodes. We use a map from integers to strings to maintain the frequencies and a local vari- 
able counts to maintain this map. The automatically inferred type of counts is map[str, int]. A 
visit statement is used that traverses an arbitrary data structure, matches the patterns for the cases to 
all substructures, and executes the case when a match is found. The statement counts[color] ?0 += 1 
increments the current frequency count for the given color if it exists or it increments otherwise. Note 
how this affects the value of the local variable counts. The RASCAL code is shown in Listing [T] 



^This example is for comparison only and is a-typical since, unlike Asf+Sdf, RASCAL has built-in, arbitrary length, 
integers and reals. 



We use the constructor z ( ) to represent 0. 
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Listing 1 Counting frequencies of colors in a ColoredTree 
public map[str, int] colorDistribution(ColoredTree t) { 
counts = 0; // initialize an empty map 

visit(t) { // all leaves and composite nodes in the tree 
case coniposite(str color, ^, _): 

// for each composite node: increment count for color 
// (use as default when not yet in table) 
counts[color] ? += 1; 
} 

return counts; 
} 

4 Applications 

We describe three realistic applications of RASCAL to model-driven software development here and one 
example of connecting RASCAL to an existing executable specification. 



A DSL for Entity modeling (Section 4.1 1. This educational example is based on our submission 
to the Language Workbench Competition 201 l|jand illustrates modularity, syntax definition, AST 
types and code generation. 



• ECore 111211 (Section 4.2 1 is a well-known class-based meta-model used in many Eclipse-based 
modeling tools such as Kermeta, ATL and XText. This example shows an encoding of ECore 
using Rascal types, in particular the use of relations for DAG-like and cyclic structures. 

• Derric (Section |4.3| ) is a real- world DSL for describing binary file formats and is used in the 
digital forensics domain to generate data recovery tools. Despite the small size of their implemen- 
tation, the DERRlC-based tools are comparable in functionality and performance to their industrial- 
strength counterparts currently used in practice. 

• RLS -Runner (Section[44]l is a library plug-in for Rascal that enables the execution of existing 
Maude specifications using a combination of higher-order functions and a co-routine implemented 
using pipes. This enables us to reuse existing program analysis specifications instead of requiring 
them to be rewritten in RASCAL. 

4.1 A simple DSL: Entities 
4.1.1 Concrete and abstract syntax 

Listing 2 Syntax definition of Entity models 

Import lang : : entities : : syntax : : Layout ; 

Import lang : : entities : : syntax : : Ident ; 

import lang: : entities: : syntax: :Types; 

start syntax Entities = entities: Entity* entities; 

syntax Entity = entity: "entity" Name name "{" Field* "}"; 

syntax Field = field: Type Ident name; 

The Entities DSL allows you to declare entity types in order to model business objects. An entity has 
named fields, which are either primitively typed (integer, string, boolean), or contain a reference to 



^ http: //www. languageworkbenches.net I 
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another entity. An excerpt of the syntax definition of entity models is shown in Listing [2] First, auxiliary 
(syntax) modules are imported for defining Layout, Identifiers and Types. An Entity model then consists 
of a sequence of zero or more Entity-s. An Entity starts with the keyword entity, followed by a name 
and a sequence of zero or more Fields. Finally, a Field consists of a Type (integer, string, boolean or 
Entity reference) and a name. The Entities non-terminal is the start symbol of the Entities grammar as 
indicated by the sta rt keyword. 

Listing 3 Abstract syntax of entities 

data Entities = entities(list[Entity] entities); 

data Entity = entity(Name name, list[Field] fields); 

data Field = field(Type \type, str name); 

data Type = primitive(PrimitiveType primitive) | reference(Name name); 

data Name = name(str name); 

data PrimitiveType = stringO | dateO | integer() | boolean() | currency!); 

Whereas Asf+Sdf allowed only rewriting of concrete syntax trees, RASCAL supports the automatic 
mapping of parse trees to ASTs, using the library function implode. This function converts a parse tree 
to an AST that conforms to a RASCAL ADT describing the abstract syntax. Listing |3] shows an ADT 
describing the abstract syntax of Entity models. For every syntax production in the grammar for entities, 
there is a corresponding constructor in this ADT. Every constructor has the same number of arguments 
as the number of symbols in the production (modulo keywords and layout). Lexical tokens are mapped 
to Rascal primitive types. 

Transformation of concrete syntax trees is useful in cases where layout preservation is essential, 
such as refactoring or legacy renovation. However, for MDE applications this is often less important. As 
already discussed earlier, RASCAL addresses this issue by providing light weight string templates, next 
to full-blown source-to-source transformations. Below we describe a simple, template-based, Java code 
generator for entity models. 

4.1.2 Java code generation 

The code generator is shown in Listing [4] It is defined using ordinary RASCAL functions that pro- 
duce string values using Rascal's built-in string templates. As an example, consider the function 
entity2java. The string value returned by entity2java uses string interpolation in two ways. First, 
the name of the Entity e is directly spliced into the string via the interpolated expression e . name . name 
between < and >. Next the body of the class is produced using an interpolated for-loop. This for-loop 
evaluates its body (a string template again) and concatenates the result of each iteration. For each field, 
the function field2java is called to generate a field with getter and setter declarations. The single 
quote (') acts as margin: all white space to its left is discarded. Furthermore, every interpolated value 
is indented automatically relative to this margin. As a result the, output of each consecutive call to 
field2java is nicely indented in the class definition. Again, for many cases, this obviates the need for 
grammar-based formatters to generate readable code. 

The function field2]ava is an example of pattern-based dispatch, as introduced in Section [3] The 
f ield2 j ava function is implemented in a style reminiscent of term rewriting. The f ield2 j ava function 
thus matches its first parameter against the pattern field ( typ , n ) . This technique is a powerful tool for 
implementing languages in a modular fashion. For instance, the entity language could be extended so 
that entities support computed attributes. This will involve adding new production rules to the grammar, 
and new field constructors to the abstract syntax. Finally, the code generator would have to be extended. 
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Listing 4 Functions to generate Java source code from Entity models 
public str entity2java(Entity e) { 
return "public^class^<e.name.name>^{ 

'<for (f <- e. fields) {> 

'^^<field2java{f )> 

'<}> 

} 

public str field2java(field(typ, n)) { 

<t, cn> = <type2java(typ) , capitalize(n)>; 



return 



private^<t>^<n>; 
public^<t>^get<cn>{ )^{ 
^^return^this .<n>; 

} 
public^void^set<cn>(<t>^<n>)^{ 

}"; 



Using pattern-based dispatch this can be achieved by adding additional field2java declarations that 
match on the new AST constructors. No part of the original code generator has to be modified. 

4.1.3 IDE support 

No language can do without IDE support, and this includes DSLs. RASCAL exposes hooks into the 
Eclipse-based IMP [ 13 1 framework for dynamically creating IDE support from within RASCAL. These 
hooks allow the dynamic registration of, for instance, parsers, type checkers, outliners and reference 
resolvers. A screen-shot of the generated IDE for the Entities language is shown in Figure [2] The 
generated IDE runs within the RASCAL Eclipse IDE, so the package explorer on the left actually shows 
the source code of the implementation of the Entities DSL. In the middle you see an editor containing 
a simple entity model. It has syntax highlighting and folding which are both based on the context-free 
grammar. As you can see, there is an error: entity Person references an undefined entity Car2. On the 
right an outline is shown detailing the structure of this entity model. Clicking on an outline element 
highlights the corresponding source fragment. At the bottom of the editor pane, (a fragment of) the 
context-menu is shown, including entries to invoke various code generators. 

4.2 Relational meta-modeling: ECore 

Algebraic data types generally do not support expressing structures with sharing and/or cycles. Never- 
theless, in MDE such graph-like structures, especially class-based models, are very common. RASCAL is 
a functional programming language in that all the data is immutable. To deal with graph structures (e.g., 
control-flow graphs, call-graphs, automata, work-flow models etc.) RASCAL provides relations. Rela- 
tions are basically sets of tuples which can be queried using comprehensions. Additionally, RASCAL 
provides built-in support for computing the transitive closure of a binary relation. 

As an example of how one might encode a class-based model as a RASCAL data type. Listing [5] 
shows an excerpt of the ECore meta meta-model. We have used this ADT to successfully import around 
300 ECore models. The top-level constructor ecore contains a set of Classifiers, a subtype relation 
between Classifiers and a typing relation from Items (scoped in Classifiers) to Types. The last two 
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Figure 2: Screen-shot of the dynamically generated IDE for the Entities DSL 

constructor arguments capture the essential sharing and/or cyclicity that may be present in a class model. 
Since classifiers are identified by a package qualification (Package, not shown) and a name, such values 
can be used as indices into the subtype and typing relations. 

For instance, assume we have variable class containing a Class value representing a Person class. 
One could then find all super classes of this class using the transitive closure of the subtype relation of 
an ECore model e: 

class = concrete! "Person" , [attribute! "name" , {}, stringO)]); 
for (sup <- e.subtype+[class] ) 
print( "Super:^<sup>" ) ; 



The encoding shown here is non-trivial and it makes a number of trade-offs and short-cuts with re- 
spect to accurate typing of class models. For instance, the type of the subtype relation allows primitive 
types to be sub- and super-types because they are in fact classifiers; technically this is incorrect. Nev- 
ertheless, making the encoding more strict would also introduce more indirections and hence, introduce 
more case distinctions when processing ECore models. 

Another observation is that the encoding is very convenient for querying, but less than optimal for 
transformation. Model transformation of Ecore models encoded this way would entail creating new 
ecore values every time a single element is changed, at arbitrary depth in the constructor. Apart from 
not being very efficient, a problem with this approach is the lack of locality: every transformation, no 
matter how small and localized, has to have knowledge of the complete value. We consider this to be an 
area of further research. 
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Listing 5 ADT for ECore (excerpt) 

data ECore = ecore(set[Classifier] classifiers, 

rel[Classifier, Classifier] subtype, 
rel[Classifier, Item, Type] typing); 
data Classifier = dataType(Package package, str name) // Package omitted 

I class(Package package. Class class); 
data Class = concrete(str name, list[Item] items) 
I interface(str name, list[Item] items) 
I abstract(str name, list[Item] items); 
data Item = operation(str name, set[Option] options) 

I parameter(str operation, str name, set[Option] options) 
I attribute(str name, set[Option] options. Type dataType) 
I reference(str name, set[Option] options); 
data Type = classifier(Classifier classifier); 



4.3 A model-based approach to digital forensics: Derric 

Another MDE application of RASCAL is in the domain of digital forensics. Investigations in this area are 
often related to recovery of deleted, obfuscated, hidden or otherwise difficult to access data. The software 
tools to recover such data require lots of modifications to deal with different variants of file formats, file 
systems, encodings etc. Additionally they are also required to return a result within a reasonable amount 
of time on data sets in the terabyte range. We are investigating a model-driven approach to this problem 
by designing a DSL, DERRIC, to easily express the data structures of interest. From these descriptions 
we generate high performance tools for specific forensic applications. 

Derric is a declarative data description language used to describe complex and large-scale binary 
file formats, such as video codecs and embedded memory layouts. It is essentially a very fine-grained 
grammar formalism to precisely capture the way files are stored. For example, it is possible to define a 
component of a file format to be a 21 -bit unsigned integer that is always stored in big endian byte order. 
Figure [3] shows an excerpt of a DERRIC specification for the JPEG image file format. 

Derric is a language that can be used for many digital forensics applications. Currently, we have 
implemented Derric in RASCAL and have used it to develop a digital forensics data recovery tool 
called Excavator (see Figure |4]). Excavator is used for file carving: recovering files from storage 
devices without using file system meta-data (these are often unavailable or incomplete). EXCAVATOR is 
implemented as a code generator. It generates a validator that checks whether a series of bytes conforms 
(or might conform) to a certain file format. 

The steps in the implementation of EXCAVATOR are shown in Figure [4] The first step consists of 
parsing the DERRIC source text and converting the parse tree to an AST (implode). This AST is the 
starting point of a series of refinements where each step takes a complete AST as input and produces a 
modified AST (of the same type) that is better suited to the final goal of generating a validator for the 
described format. 

One refinement consists of annotating the AST with derived values required in later stages. For 
instance, such values could include size and offset information, used by compile-time expressions in the 
descriptions (e.g., lengthOf and offset on lines 4 and 5 in Figure|3]). 

Another refinement consists of simplifying the AST by performing compiler optimizations such as 
constant folding and propagation (e.g., replacing string constants such as on line 6 in Figure [3] by a list 
of bytes corresponding to the string in the defined encoding, so that the generated code only has to do 
simple byte comparisons). 
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1 structures 

2APP0JFIF { 

3 marker: 0xFF, 0xE0; 

4 length: lengthOf ( rgb) + {offset(rgb) 

5 - offset(identifier) ) size 2; 

6 identifier: "JFIF", 0; 

7 version: expected 1, 2; 

8 units: | 1 | 2; 

9 xthumbnail: size 1; 

10 ythumbnail: size 1; 

11 rgb: size xthumbnail * ythumbnail * 3; 

12} 
13 
14DHT { 

15 marker: 0xFF, 0xC4; 

16 length: size 2; 

17 data: size length - lengthOf (marker) ; 

18} 

19 

2OS0S = DHT { 

21 marker: 0xFF, 0xDA; 

22 compressedData: 

23 unknown terminatedBefore 0xFF, !0x00; 

24} 

Figure 3: Excerpt of JPEG in Derric 
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Figure 4: Use of DERRIC in EXCAVATOR 
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Figure 5: Sizes of the EXCAVATOR components 



Finally, the Derric implementation supports a number of optional refinements, which can be exe- 
cuted on demand. An example is to replace parts of the AST with alternatives that result in code that is 
either faster or more precise. As such, this allows users to configure the trade-off between accuracy and 
runtime performance on a case-by-case basis. 

The final step of Figure [4] consists of generating code. We currently have a code generator that 
generates Java code and as a result, all Derric types are annotated with a target type that maps cleanly 
onto Java types. For instance, a 32-bit unsigned type will be stored in a 64-bit signed type since Java 
does not support unsigned types. The resulting code is then loaded by the EXCAVATOR runtime system 
to recover files from disk images. 

To evaluate EXCAVATOR, we have compared it to three industrial-strength carving tools on a set 
of standard benchmarks ||5l. Our evaluation shows that even though the implementation is very small 
(see Figure [5]l, it performs as good as the competing tools both in terms of functionality and runtime 
performance, while providing a much higher level of flexibility to the user. 



4.4 Rascal front-ends for K program analysis semantics 

The K f32l semantics framework is an executable framework for defining the semantics of programming 
languages. Semantics for program analysis can be defined similarly to those for standard evaluation, with 
rules evaluating program constructs over abstract value domains. Work in this area includes matching 
logic lISTTl . a logic with similar goals to separation logic |[29l : and analysis policies, where a policy is 
defined as a combination of a generic analysis semantics for a language, specific extensions for each 
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Figure 6: Integrating RASCAL with K Definitions in Maude 

specific analysis, and an annotation language tailored to each analysis ||2TI . 

One limitation of this work is that it has focused on the semantics, but not on the entire tool chain. 
This means that the process of transforming the input program into something that can be evaluated in 
a K semantics, or of taking the results and providing them to the user of the analysis in some useful 
form, has always been approached in an ad-hoc fashion. While not a theoretical problem, this makes the 
analyses much less useful in practice. 

The RLSRunner tool |[20l provides a solution to this for K definitions compiled to run in Maude |[T5l . 
a language and engine for defining, evaluating, and reasoning about rewriting logic |[28l specifications. 
An overview of the RLSRunner integration with Maude is shown in Figure |6] First, using RASCAL, 
one defines the grammar for the language being analyzed, which is used to generate a parser for the 
language. As shown earlier, this automatically provides for a basic IDE for the language. A maudeifier 
is then defined, allowing the parse tree of the program to be analyzed to be transformed into a prefix 
form easily consumable by Maude. This prefix form also includes location information, encoded using 
a K definition for locations, which can be used to tag errors with the location of the offending construct. 
RLSRunner library functions are then used to register handlers both for preparing the term to be given to 
Maude and for parsing the term resulting from evaluation. Other RLSRunner functions allow RASCAL to 
interact with Maude and with the Eclipse environment, allowing error information returned as a result of 
the analysis to be shown in the IDE and in the Problems view. An example of showing this information 
in Eclipse is provided in Figure [7] for a units of measurement analysis in a simple imperative language. 
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begin 
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end 
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function tTiain(void) 
begin 

var $lb projectileWeightj 

projectileWeight ;- 5; 

write £r.ojecJ.ii£Vteiaht.jt..Vi2kiLCp.n5jsf,t.\J,sOSfej,abii; 



Q Unit type failure, attempting to ddd tncompatlble units; (projectileWeight - 
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Figure 7: Units Arithmetic Error, Shown in Eclipse 
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5 Concluding Remarks 

We have presented the lessons we learned going from algebraic specification languages to algebraic 
programming languages. These lessons were input for the design rationale for the RASCAL language: a 
domain-specific programming language for meta-programming. RASCAL is easy to teach, learn and use 
in the domain of meta programming, but is still true to its algebraic specification roots. 
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